Outlier Test_v1

Outlier Detection Methods - Mathematical Formulas

1. Interquartile Range (IQR) Method

Definition: $$IQR = Q_3 - Q_1$$

Where:

$Q_1$ = First quartile (25th percentile)
$Q_3$ = Third quartile (75th percentile)

Outlier Boundaries: $$\text{Lower Bound} = Q_1 - 1.5 \times IQR$$ $$\text{Upper Bound} = Q_3 + 1.5 \times IQR$$

Outlier Criterion: A data point $x$ is classified as an outlier if: $$x < Q_1 - 1.5 \times IQR \quad \text{or} \quad x > Q_3 + 1.5 \times IQR$$

Pass/Fail Criterion: $$\text{Outlier Percentage} = \frac{\text{Number of Outliers}}{\text{Total Data Points}} \times 100%$$

Station PASSES if: Outlier Percentage < 10%
Station FAILS if: Outlier Percentage ≥ 10%

Step-by-Step Outlier Detection Analysis

Overview

This analysis performs outlier detection on precipitation data from 56 rainfall stations using the Interquartile Range (IQR) method. The goal is to identify stations with excessive outliers that may indicate data quality issues.

Analysis Steps

Step 1: Data Loading and Inspection

Load precipitation data from Excel file
Check dataset dimensions and structure
Preview the first few rows to understand the data format
Expected format: Date column + 56 station columns with precipitation values

Step 2: Understanding the IQR Method

Mathematical foundation defined using quartiles (Q1, Q3)
Outlier boundaries: Q1 - 1.5×IQR (lower) and Q3 + 1.5×IQR (upper)
Pass/Fail Criterion: Stations with <10% outliers PASS, ≥10% FAIL

Step 3: Outlier Detection Implementation

Apply IQR method to each station's precipitation data
Calculate outlier counts, percentages, and boundary values
Classify stations as PASSED or FAILED based on 10% threshold
Generate summary statistics for all stations

Step 4: Export Results

Save outlier detection results to Excel file
Include: station names, outlier counts, percentages, bounds, and pass/fail status

Step 5: Visualization

Create boxplots for all 56 stations (split across 4 figures)
Each plot shows: data distribution, outliers (points beyond whiskers), and outlier statistics
Visual inspection helps identify patterns and validate numerical results

Expected Outcomes

Identification of stations with reliable data quality (< 10% outliers)
Detection of problematic stations requiring further investigation
Visual and numerical evidence for data quality assessment

Python

import pandas as pd
import numpy as np
from scipy import stats

import matplotlib.pyplot as plt

Python

file_source = r"\Eligibility Test\Data\grids_precipitation_data.xlsx"

Python

# Load the data
df = pd.read_excel(file_source)

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nColumn names:\n{df.columns.tolist()}")
print(f"\nFirst few rows:\n{df.head()}")

Python

# Function to detect outliers using IQR method
def detect_outliers_iqr(data):
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = (data < lower_bound) | (data > upper_bound)
    return outliers, lower_bound, upper_bound

# Apply outlier detection for each station (columns 1 to 56)
outlier_summary = {}

for col in df.columns:
    if col != 'Date':
        outliers_iqr, lower, upper = detect_outliers_iqr(df[col].dropna())
        outlier_summary[col] = {
            'total_outliers_iqr': outliers_iqr.sum(),
            'lower_bound': lower,
            'upper_bound': upper,
            'outlier_percentage': (outliers_iqr.sum() / len(df[col].dropna())) * 100
        }

# Display outlier summary
outlier_df = pd.DataFrame(outlier_summary).T
print(f"\nOutlier Summary:\n{outlier_df}")

# Define acceptable threshold for outlier percentage
acceptable_threshold = 10.0  # 10% threshold

# Check which stations pass the outlier test
outlier_df['passed_test'] = outlier_df['outlier_percentage'] < acceptable_threshold

# Display stations that passed and failed
print(f"\nStations that PASSED the outlier test (< {acceptable_threshold}%):")
print(outlier_df[outlier_df['passed_test']].index.tolist())

print(f"\nStations that FAILED the outlier test (>= {acceptable_threshold}%):")
print(outlier_df[~outlier_df['passed_test']].index.tolist())

print(f"\nSummary: {outlier_df['passed_test'].sum()} out of {len(outlier_df)} stations passed the test")

Python

# Save the outlier test results to Excel
output_file = r"\outlier_test_results.xlsx"

# Create a copy of outlier_df and reset index to include station names as a column
outlier_results = outlier_df.copy()
outlier_results.insert(0, 'Station', outlier_results.index)

# Save to Excel
outlier_results.to_excel(output_file, index=False)

print(f"Outlier test results saved to: {output_file}")

Python

# Create boxplots for all stations with outlier information
# Split into 4 separate figures for better visibility
n_stations = len([col for col in df.columns if col != 'Date'])
stations_per_fig = 14  # 14 stations per figure (2 rows x 7 columns)
station_cols = [col for col in df.columns if col != 'Date']

for fig_num in range(4):
    start_idx = fig_num * stations_per_fig
    end_idx = min(start_idx + stations_per_fig, n_stations)
    
    n_plots = end_idx - start_idx
    n_rows = 2
    n_cols = 7
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, 8))
    axes = axes.flatten()
    
    # Plot stations for this figure
    for i in range(n_plots):
        col = station_cols[start_idx + i]
        axes[i].boxplot(df[col].dropna(), vert=True)
        outlier_pct = outlier_df.loc[col, 'outlier_percentage']
        outlier_count = int(outlier_df.loc[col, 'total_outliers_iqr'])
        axes[i].set_title(f'Station {col}\nOutliers: {outlier_count} ({outlier_pct:.2f}%)', fontsize=9)
        axes[i].set_ylabel('Precipitation (mm)')
        axes[i].grid(True, alpha=0.3)
    
    # Hide unused subplots
    for i in range(n_plots, n_rows * n_cols):
        axes[i].axis('off')
    
    plt.tight_layout()
    plt.show()